Hyperparameter Tuning with Optuna¶
With great models, comes the great problem of optimizing hyperparameters [Tha20]. Once a good search algorithm is established for hyperparameter optimization, the task becomes an engineering problem 1. Hence, we will explore an open-source library that offers a framework for solving this task.
Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API. Thanks to our define-by-run API, the code written with Optuna enjoys high modularity, and the user of Optuna can dynamically construct the search spaces for the hyperparameters.
Basics with scikit-learn¶
Optuna is a black-box optimizer, which means it only needs an objective function, which is any function that returns a numerical value, to evaluate the performance of the its parameters, and decide where to sample in upcoming trials. An optimization problem is framed in the Optuna API using two basic concepts: study and trial.
A study is conceptually an optimization based on an objective function, while a trial is a single execution of an objective function. The combination of hyperparameters for each trial is sampled according to some sampling algorithm defined by the study.
In the following code example, the search space is constructed within imperative Python code, e.g. inside conditionals or loops. On the other hand, recall that for GridSearchCV and RandomSearchCV in scikit-learn, we had to define the entire search space before running the search algorithm.
!pip install optuna
import optuna
import pandas as pd
from sklearn import ensemble, svm
from sklearn import datasets
from sklearn import model_selection
from functools import partial
import joblib
# [1] Define an objective function to be maximized.
def objective(trial, X, y):
# [2] Suggest values for the hyperparameters using trial object.
clf_name = trial.suggest_categorical('classifier', ['SVC', 'RandomForest'])
if clf_name == 'SVC':
svc_c = trial.suggest_loguniform('svc_c', 1e-10, 1e10)
clf = svm.SVC(C=svc_c, gamma='auto')
else:
rf_max_depth = int(trial.suggest_loguniform('rf_max_depth', 2, 32))
clf = ensemble.RandomForestClassifier(max_depth=rf_max_depth, n_estimators=10)
score = model_selection.cross_val_score(clf, X, y, n_jobs=-1, cv=5)
return score.mean()
# [3] Create a study object and optimize the objective function.
X, y = datasets.load_breast_cancer(return_X_y=True)
study = optuna.create_study(direction="maximize")
study.optimize(partial(objective, X=X, y=y), n_trials=5)
Requirement already satisfied: optuna in /usr/local/lib/python3.7/dist-packages (2.9.1)
Requirement already satisfied: cliff in /usr/local/lib/python3.7/dist-packages (from optuna) (3.9.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from optuna) (1.19.5)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from optuna) (4.62.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (21.0)
Requirement already satisfied: scipy!=1.4.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.1)
Requirement already satisfied: colorlog in /usr/local/lib/python3.7/dist-packages (from optuna) (6.4.1)
Requirement already satisfied: alembic in /usr/local/lib/python3.7/dist-packages (from optuna) (1.7.3)
Requirement already satisfied: sqlalchemy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from optuna) (1.4.23)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from optuna) (3.13)
Requirement already satisfied: cmaes>=0.8.2 in /usr/local/lib/python3.7/dist-packages (from optuna) (0.8.2)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->optuna) (2.4.7)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (1.1.1)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from sqlalchemy>=1.1.0->optuna) (4.8.1)
Requirement already satisfied: Mako in /usr/local/lib/python3.7/dist-packages (from alembic->optuna) (1.1.5)
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.7/dist-packages (from alembic->optuna) (5.2.2)
Requirement already satisfied: autopage>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (0.4.0)
Requirement already satisfied: cmd2>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (2.2.0)
Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (5.6.0)
Requirement already satisfied: stevedore>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (3.4.0)
Requirement already satisfied: PrettyTable>=0.7.2 in /usr/local/lib/python3.7/dist-packages (from cliff->optuna) (2.2.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (3.7.4.3)
Requirement already satisfied: pyperclip>=1.6 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (1.8.2)
Requirement already satisfied: colorama>=0.3.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.4.4)
Requirement already satisfied: wcwidth>=0.1.7 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (0.2.5)
Requirement already satisfied: attrs>=16.3.0 in /usr/local/lib/python3.7/dist-packages (from cmd2>=1.0.0->cliff->optuna) (21.2.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy>=1.1.0->optuna) (3.5.0)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.7/dist-packages (from Mako->alembic->optuna) (2.0.1)
[I 2021-09-23 17:54:39,422] A new study created in memory with name: no-name-e4f2c3f5-5d18-48a2-9e80-803c7890c30c
[I 2021-09-23 17:54:40,580] Trial 0 finished with value: 0.9525229001707809 and parameters: {'classifier': 'RandomForest', 'rf_max_depth': 3.0117830841670483}. Best is trial 0 with value: 0.9525229001707809.
[I 2021-09-23 17:54:40,684] Trial 1 finished with value: 0.6274181027790716 and parameters: {'classifier': 'SVC', 'svc_c': 15275857.681467317}. Best is trial 0 with value: 0.9525229001707809.
[I 2021-09-23 17:54:40,783] Trial 2 finished with value: 0.6274181027790716 and parameters: {'classifier': 'SVC', 'svc_c': 297357093.2564345}. Best is trial 0 with value: 0.9525229001707809.
[I 2021-09-23 17:54:40,884] Trial 3 finished with value: 0.6274181027790716 and parameters: {'classifier': 'SVC', 'svc_c': 241127.03816762505}. Best is trial 0 with value: 0.9525229001707809.
[I 2021-09-23 17:54:41,005] Trial 4 finished with value: 0.9543238627542306 and parameters: {'classifier': 'RandomForest', 'rf_max_depth': 6.143111746204174}. Best is trial 4 with value: 0.9543238627542306.
The study object saves the result of evaluating the objective each trial — which is essentially some choice of hyperparameters to evaluate. In the above study, the problem of model selection is framed as a hyperparameter optimization problem. Here we choose between an SVM-based algorithm or Random Forest.
study.trials_dataframe().head()
| number | value | datetime_start | datetime_complete | duration | params_classifier | params_rf_max_depth | params_svc_c | state | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.952523 | 2021-09-23 17:54:39.425578 | 2021-09-23 17:54:40.579898 | 0 days 00:00:01.154320 | RandomForest | 3.011783 | NaN | COMPLETE |
| 1 | 1 | 0.627418 | 2021-09-23 17:54:40.582807 | 2021-09-23 17:54:40.684042 | 0 days 00:00:00.101235 | SVC | NaN | 1.527586e+07 | COMPLETE |
| 2 | 2 | 0.627418 | 2021-09-23 17:54:40.685710 | 2021-09-23 17:54:40.782860 | 0 days 00:00:00.097150 | SVC | NaN | 2.973571e+08 | COMPLETE |
| 3 | 3 | 0.627418 | 2021-09-23 17:54:40.784771 | 2021-09-23 17:54:40.884335 | 0 days 00:00:00.099564 | SVC | NaN | 2.411270e+05 | COMPLETE |
| 4 | 4 | 0.954324 | 2021-09-23 17:54:40.886240 | 2021-09-23 17:54:41.004901 | 0 days 00:00:00.118661 | RandomForest | 6.143112 | NaN | COMPLETE |
Fine tuning Random Forest¶
Here we focus on tuning a single Random Forest model. Then, plot the accuracy for each pair of hyperparameters.
def objective(trial):
max_depth = trial.suggest_int('max_depth', 2, 128, log=True)
max_features = trial.suggest_float('max_features', 0.1, 1.0)
n_estimators = trial.suggest_int('n_estimators', 100, 800)
clf = ensemble.RandomForestClassifier(
max_depth=max_depth,
n_estimators=n_estimators,
max_features=max_features,
random_state=42)
score = model_selection.cross_val_score(clf, X, y, n_jobs=-1, cv=5)
return score.mean()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)
[I 2021-09-23 17:54:41,088] A new study created in memory with name: no-name-1118f726-3850-4750-9359-55c0ea45a8f8
[I 2021-09-23 17:54:48,956] Trial 0 finished with value: 0.9578481602235678 and parameters: {'max_depth': 8, 'max_features': 0.13883223749967205, 'n_estimators': 581}. Best is trial 0 with value: 0.9578481602235678.
[I 2021-09-23 17:54:55,549] Trial 1 finished with value: 0.9596491228070174 and parameters: {'max_depth': 119, 'max_features': 0.8704051169259739, 'n_estimators': 188}. Best is trial 1 with value: 0.9596491228070174.
[I 2021-09-23 17:55:13,415] Trial 2 finished with value: 0.9596180717279925 and parameters: {'max_depth': 44, 'max_features': 0.6700603339499863, 'n_estimators': 698}. Best is trial 1 with value: 0.9596491228070174.
[I 2021-09-23 17:55:20,477] Trial 3 finished with value: 0.9613724576929048 and parameters: {'max_depth': 9, 'max_features': 0.228797623879799, 'n_estimators': 717}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:29,341] Trial 4 finished with value: 0.9613724576929048 and parameters: {'max_depth': 29, 'max_features': 0.6566210378425625, 'n_estimators': 580}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:31,665] Trial 5 finished with value: 0.9578792113025927 and parameters: {'max_depth': 10, 'max_features': 0.9737412231250336, 'n_estimators': 119}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:34,767] Trial 6 finished with value: 0.9490607048594939 and parameters: {'max_depth': 2, 'max_features': 0.16966704315981518, 'n_estimators': 371}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:40,593] Trial 7 finished with value: 0.956078248719143 and parameters: {'max_depth': 4, 'max_features': 0.16266334215164907, 'n_estimators': 668}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:42,917] Trial 8 finished with value: 0.9578326346840553 and parameters: {'max_depth': 57, 'max_features': 0.16917559497644086, 'n_estimators': 243}. Best is trial 3 with value: 0.9613724576929048.
[I 2021-09-23 17:55:48,998] Trial 9 finished with value: 0.95960254618848 and parameters: {'max_depth': 6, 'max_features': 0.2007403313381485, 'n_estimators': 615}. Best is trial 3 with value: 0.9613724576929048.
study.best_params
{'max_depth': 9, 'max_features': 0.228797623879799, 'n_estimators': 717}
study.best_value
0.9613724576929048
Sampling algorithms¶
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=1, ncols=3)
def plot_results(study, p1, p2, j, cb):
study.trials_dataframe().plot(
kind='scatter', ax=axes[j], x=p1, y=p2,
c='value', s=60, cmap=plt.get_cmap("jet"),
colorbar=cb, label="accuracy", figsize=(16, 4)
)
plot_results(study, 'params_max_depth', 'params_n_estimators', j=0, cb=False)
plot_results(study, 'params_max_depth', 'params_max_features', j=1, cb=False)
plot_results(study, 'params_n_estimators', 'params_max_features', j=2, cb=True);
Figure. TPE in action. Optuna uses Tree-structured Parzen Estimater (TPE) [BBBK11] as the default sampler which is a form of Bayesian optimization. Observe that the hyperparameter space is searched more efficiently than a random search with the sampler choosing points closer to previous good results. Samplers are specified when creating a study:
study = create_study(direction="maximize", sampler=optuna.samplers.TPESampler())
From the docs:
On each trial, for each parameter, TPE fits one Gaussian Mixture Model (GMM)
l(x)to the set of parameter values associated with the best objective values, and another GMMg(x)to the remaining parameter values. It chooses the parameter valuexthat maximizes the ratiol(x)/g(x).
Thus, TPE samples every hyperparameter independently — no explicit hyperparameter interactions are considered when sampling future trials, although other parameters implicitly affect objective value. Optuna also implements old friends random and grid search in the following samplers:
optuna.samplers.GridSampleroptuna.samplers.RandomSampler
Results from the paper [ASY+19]:
TPE+CMA-ES sampling can be implemented as follows:
sampler = optuna.samplers.CmaEsSampler(
warn_independent_sampling=False,
independent_sampler=optuna.samplers.TPESampler()
)
This uses the CMA-ES algorithm [Han16] with TPE for searching dynamically constructed hyperparameters (as CMA-ES requires that parameters are specified prior to the optimization).
Visualizations¶
First define a helper function for displaying plotly plots as HTML.
from IPython.core.display import display, HTML
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
config={'showLink': False, 'displayModeBar': False}
fig_count = 0
# See https://github.com/executablebooks/jupyter-book/issues/93 <!>
# Solves issue of having blank plotly plots in the build. No need to
# save the generated HTML files. Probably embedded into the notebook.
def plot_html(fig):
global fig_count
plot(fig, filename=f'optuna-{fig_count}.html', config=config)
display(HTML(f'optuna-{fig_count}.html'))
fig_count += 1
Optuna provides visualization functions in the optuna.visualization library 2. The following plot shows the best objective value found as the trials progress. The increasing trend in accuracy indicates that the TPE sampler is working well, i.e. the search algorithm learns from previous trials.
optuna.visualization.plot_optimization_history(study)
The parallel coordinate plot gives us a feel of how the hyperparameters interact. For instance, max_features around 0.5 with n_estimators around 280 and max_depth around 20 generally perform well. This setting includes the best performing hyperparameters. To isolate subsets of lines, use the interactive capabilities of the plot below by dragging on each axis to restrict it. See figure immediately below.
optuna.visualization.plot_parallel_coordinate(study)
Fig. 8 Using sliders to restrict values for certain parameters.¶
Slice plots project the path of the optimizer in the hyperparameter space on each dimension, then shift along the \(y\)-axis according on its objective value. A large spread of dark dots indicate that a large range of values of that hyperparameter is feasible even at later stages. Meanwhile, a small spread means that the sampler focuses on a small part of the search space — in this case, other hyperparameters implicitly improve the objective. For example, the parameter max_features is explored at a wide range even at later trials. Hence, we think of this feature as important. Indeed, the importance plot below supports this.
plot_html(optuna.visualization.plot_slice(study, params=['n_estimators', 'max_depth', 'max_features']))
By default, the hyperparameter importance evaluator in Optuna is optuna.importance.FanovaImportanceEvaluator. This takes as input performance data gathered with different hyperparameter settings of the algorithm, fits a random forest to capture the relationship between hyperparameters and performance, and then applies functional ANOVA to assess how important each of the hyperparameters and each low-order interaction of hyperparameters is to performance [HHLB14]. From the docs:
The performance of fANOVA depends on the prediction performance of the underlying random forest model. In order to obtain high prediction performance, it is necessary to cover a wide range of the hyperparameter search space. It is recommended to use an exploration-oriented sampler such as
RandomSampler.
fig = optuna.visualization.plot_param_importances(study)
fig.update_layout(width=600, height=350)
plot_html(fig)
To visualize interactions of any pair of hyperparameters, we use contour plots. The contour plots indicate regions of high and low objective value.
fig = optuna.visualization.plot_contour(study, params=["max_depth", "max_features"])
fig.update_layout(width=550, height=500)
plot_html(fig)
Neural networks¶
As noted above, we should always perform tuning within a cross-validation framework. However, with neural networks, doing 5-fold CV would require too much compute time — hence, too much resources, e.g. GPU usage. Instead, we perform tuning on a hold-out validation set and hope for the best.
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch.utils.data import Dataset, DataLoader
from sklearn import model_selection
from sklearn.datasets import fetch_openml
from tqdm import tqdm
import optuna
import numpy as np
Define a simple network.
class MLPClassifier(nn.Module):
"""
Neural network with multiple hidden fully-connected layers with ReLU
activation and dropout.
"""
def __init__(self, input_size, num_classes, n_layers, out_features, drop_rate):
super().__init__()
layers = []
in_features = input_size
for i in range(n_layers):
m = nn.Linear(in_features, out_features[i])
nn.init.kaiming_normal_(m.weight)
nn.init.constant_(m.bias, 0)
layers.append(m)
layers.append(nn.ReLU())
layers.append(nn.Dropout(drop_rate))
in_features = out_features[i]
layers.append(nn.Linear(in_features, num_classes))
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)
We also define a Dataset class for MNIST.
class MNISTDataset(Dataset):
def __init__(self, features, targets, transform=None):
self.features = features
self.targets = targets
self.transform = transform
def __len__(self):
return self.features.shape[0]
def __getitem__(self, i):
X = self.features[i, :]
y = self.targets[i]
if self.transform is not None:
X = self.transform(X)
return X, y
Define a trainer for the neural network model. This will handle all loss and metric evaluation, as well as backpropagation.
class Engine:
"""Neural network trainer."""
def __init__(self, model, device, optimizer):
self.model = model
self.device = device
self.optimizer = optimizer
@staticmethod
def loss_fn(outputs, targets):
return nn.CrossEntropyLoss()(outputs, targets)
def train(self, data_loader):
"""Train model on one epoch. Return train loss."""
self.model.train()
loss = 0
for i, (data, targets) in enumerate(data_loader):
data = data.to(self.device).reshape(data.shape[0], -1).float()
targets = targets.to(self.device).long()
# Forward pass
outputs = self.model(data)
J = self.loss_fn(outputs, targets)
# Backward pass
self.optimizer.zero_grad()
J.backward()
self.optimizer.step()
# Cumulative loss
loss += (J.detach().item() - loss) / (i + 1)
return loss
def eval(self, data_loader):
"""Return validation loss and validation accuracy."""
self.model.eval()
num_correct = 0
num_samples = 0
loss = 0.0
with torch.no_grad():
for i, (data, targets) in enumerate(data_loader):
data = data.to(self.device).float()
targets = targets.to(self.device)
# Forward pass
data = data.reshape(data.shape[0], -1)
out = self.model(data)
J = self.loss_fn(out, targets)
_, preds = out.max(dim=1)
# Cumulative metrics
loss += (J.detach().item() - loss) / (i + 1)
num_correct += (preds == targets).sum().item()
num_samples += preds.shape[0]
acc = num_correct / num_samples
return loss, acc
Some config and setup prior to training. For our dataset, we use MNIST which we get from scikit-learn.
# Config
RANDOM_STATE = 42
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
EPOCHS = 100
PATIENCE = 5
INPUT_SIZE = 784
NUM_CLASSES = 10
# Fetch data
MNIST = fetch_openml("mnist_784")
X = MNIST['data'].reshape(-1, 28, 28)
y = MNIST['target'].astype(int)
# Create folds
cv = model_selection.StratifiedKFold(n_splits=5)
trn_, val_ = next(iter(cv.split(X=X, y=y)))
# Get train and valid data loaders
train_dataset = MNISTDataset(X[trn_, :], y[trn_], transform=transforms.ToTensor())
valid_dataset = MNISTDataset(X[val_, :], y[val_], transform=transforms.ToTensor())
Intermediate values¶
Finally, we set up the study instance and its objective function. Note that the search space is dynamically constructed depending on the number of layers (i.e. an earlier suggestion for a hyperparameter). During training, we perform early stopping on validation loss. If no new minimum val. loss is found after 5 epochs, then the minimum val. loss is returned as the objective 3.
Computing intermediate values allow us to prune unpromising trials to conserve resources. The default pruner in Optuna is optuna.pruners.MedianPruner which prunes a trial if its best intermediate result as of the current step (e.g. current best valid loss) is worse than the median of all intermediate results of previous trials at the current step. Hence, the best intermediate result of a pruned trial is less than the best intermediate result of 1/2 of the other trials as of that step. In our case, if the minimum val. loss does not improve too quickly, then the trial is pruned. Of course, the validation loss could descend rapidly at later steps, but the median pruner does not bet on this happening.
def define_model(trial):
# Optimize the # of layers, hidden units and dropout ratio in each layer.
n_layers = trial.suggest_int("n_layers", 1, 3)
out_features = []
drop_rate = trial.suggest_float('dropout_rate', 0.2, 0.5)
for i in range(n_layers):
out_features.append(trial.suggest_int("n_units_l{}".format(i), 4, 128))
return MLPClassifier(INPUT_SIZE, NUM_CLASSES, n_layers, out_features, drop_rate)
def objective(trial):
model = define_model(trial).to(DEVICE)
batch_size = trial.suggest_int('batch_size', 8, 512, log=True)
learning_rate = trial.suggest_loguniform('lr', 1e-5, 1e-1)
weight_decay = trial.suggest_float('weight_decay', 0.0, 0.5)
optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=3)
engine = Engine(model, DEVICE, optimizer)
# Init. dataloaders
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(dataset=valid_dataset, batch_size=batch_size, shuffle=True)
# Run training
best_loss = np.inf
patience = PATIENCE
for epoch in tqdm(range(EPOCHS), total=EPOCHS, leave=False):
# Train and validation step
train_loss = engine.train(train_loader)
valid_loss, valid_acc = engine.eval(valid_loader)
# Reduce learning rate
if scheduler is not None:
scheduler.step(valid_loss)
# Early stopping
if valid_loss < best_loss:
best_loss = valid_loss
patience = PATIENCE
else:
patience -= 1
if patience == 0:
break
# Pruning unpromising trials
trial.report(valid_loss, step=epoch)
if trial.should_prune():
raise optuna.TrialPruned()
return best_loss
# Create and run optimization problem
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=60)
[I 2021-09-23 16:35:45,054] A new study created in memory with name: no-name-12007090-de83-452d-95f7-6afe312869a9
[I 2021-09-23 16:36:55,854] Trial 0 finished with value: 0.21022988284386487 and parameters: {'n_layers': 2, 'dropout_rate': 0.27948169944648965, 'n_units_l0': 88, 'n_units_l1': 115, 'batch_size': 137, 'lr': 0.008514613889742123, 'weight_decay': 0.3670069593807663}. Best is trial 0 with value: 0.21022988284386487.
[I 2021-09-23 16:42:48,011] Trial 1 finished with value: 0.08726624973227211 and parameters: {'n_layers': 1, 'dropout_rate': 0.4338336474302041, 'n_units_l0': 125, 'batch_size': 21, 'lr': 0.0026570069013367647, 'weight_decay': 0.18948308621002236}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:44:59,376] Trial 2 finished with value: 0.1323056104992117 and parameters: {'n_layers': 2, 'dropout_rate': 0.21647904438644167, 'n_units_l0': 104, 'n_units_l1': 75, 'batch_size': 201, 'lr': 2.821174991898835e-05, 'weight_decay': 0.1794091322194259}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:45:53,755] Trial 3 finished with value: 0.22288295945950917 and parameters: {'n_layers': 1, 'dropout_rate': 0.47440500392496415, 'n_units_l0': 28, 'batch_size': 250, 'lr': 0.0001789132830643754, 'weight_decay': 0.27924720478580706}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:48:09,581] Trial 4 finished with value: 2.3010145167897114 and parameters: {'n_layers': 3, 'dropout_rate': 0.28055390550281156, 'n_units_l0': 113, 'n_units_l1': 25, 'n_units_l2': 21, 'batch_size': 13, 'lr': 0.014257299782954121, 'weight_decay': 0.13705041929131956}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:51:34,557] Trial 5 pruned.
[I 2021-09-23 16:52:43,213] Trial 6 finished with value: 0.1177581399365255 and parameters: {'n_layers': 1, 'dropout_rate': 0.4482671105170357, 'n_units_l0': 113, 'batch_size': 79, 'lr': 0.00015027733498346874, 'weight_decay': 0.3609538112622472}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:52:53,391] Trial 7 pruned.
[I 2021-09-23 16:53:05,382] Trial 8 pruned.
[I 2021-09-23 16:54:19,755] Trial 9 finished with value: 0.1301146842344524 and parameters: {'n_layers': 1, 'dropout_rate': 0.43234779233219023, 'n_units_l0': 77, 'batch_size': 92, 'lr': 0.0001244240643891887, 'weight_decay': 0.42219789050012224}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:54:25,639] Trial 10 pruned.
[I 2021-09-23 16:55:02,303] Trial 11 finished with value: 0.10693867458030581 and parameters: {'n_layers': 1, 'dropout_rate': 0.38676795279677584, 'n_units_l0': 120, 'batch_size': 445, 'lr': 0.0009051925392143345, 'weight_decay': 0.30326382618609626}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:55:51,786] Trial 12 pruned.
[I 2021-09-23 16:56:26,328] Trial 13 finished with value: 0.12493549846112731 and parameters: {'n_layers': 1, 'dropout_rate': 0.3550429345374202, 'n_units_l0': 126, 'batch_size': 503, 'lr': 0.0012205051681880804, 'weight_decay': 0.4696091813861093}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:56:42,622] Trial 14 pruned.
[I 2021-09-23 16:56:43,755] Trial 15 pruned.
[I 2021-09-23 16:56:48,863] Trial 16 pruned.
[I 2021-09-23 16:56:53,308] Trial 17 pruned.
[I 2021-09-23 16:57:10,668] Trial 18 pruned.
[I 2021-09-23 16:57:18,708] Trial 19 pruned.
[I 2021-09-23 16:57:19,972] Trial 20 pruned.
[I 2021-09-23 16:57:22,372] Trial 21 pruned.
[I 2021-09-23 16:58:29,770] Trial 22 finished with value: 0.11929829030444739 and parameters: {'n_layers': 1, 'dropout_rate': 0.4505547548269105, 'n_units_l0': 127, 'batch_size': 74, 'lr': 0.00017875642777059658, 'weight_decay': 0.4075948726688359}. Best is trial 1 with value: 0.08726624973227211.
[I 2021-09-23 16:58:42,515] Trial 23 pruned.
[I 2021-09-23 16:59:12,460] Trial 24 pruned.
[I 2021-09-23 16:59:14,469] Trial 25 pruned.
[I 2021-09-23 17:00:27,573] Trial 26 finished with value: 0.08047253501394556 and parameters: {'n_layers': 1, 'dropout_rate': 0.36885178395967316, 'n_units_l0': 108, 'batch_size': 196, 'lr': 0.00031607870752817664, 'weight_decay': 0.14341725932871502}. Best is trial 26 with value: 0.08047253501394556.
[I 2021-09-23 17:00:28,679] Trial 27 pruned.
[I 2021-09-23 17:00:30,182] Trial 28 pruned.
[I 2021-09-23 17:00:31,414] Trial 29 pruned.
[I 2021-09-23 17:00:43,521] Trial 30 pruned.
[I 2021-09-23 17:00:44,725] Trial 31 pruned.
[I 2021-09-23 17:02:07,400] Trial 32 pruned.
[I 2021-09-23 17:02:08,869] Trial 33 pruned.
[I 2021-09-23 17:02:10,157] Trial 34 pruned.
[I 2021-09-23 17:06:37,772] Trial 35 finished with value: 0.08286822465960587 and parameters: {'n_layers': 1, 'dropout_rate': 0.2566693607067595, 'n_units_l0': 92, 'batch_size': 33, 'lr': 0.00023038433623588474, 'weight_decay': 0.17907560196864042}. Best is trial 26 with value: 0.08047253501394556.
[I 2021-09-23 17:07:00,968] Trial 36 pruned.
[I 2021-09-23 17:07:11,660] Trial 37 pruned.
[I 2021-09-23 17:11:14,050] Trial 38 finished with value: 0.07970190724682759 and parameters: {'n_layers': 1, 'dropout_rate': 0.2550241291506312, 'n_units_l0': 107, 'batch_size': 27, 'lr': 0.000263494124965755, 'weight_decay': 0.16351684276444545}. Best is trial 38 with value: 0.07970190724682759.
[I 2021-09-23 17:11:20,992] Trial 39 pruned.
[I 2021-09-23 17:11:26,819] Trial 40 pruned.
[I 2021-09-23 17:12:09,104] Trial 41 pruned.
[I 2021-09-23 17:18:54,125] Trial 42 finished with value: 0.09131516358004889 and parameters: {'n_layers': 1, 'dropout_rate': 0.3099118013642247, 'n_units_l0': 122, 'batch_size': 13, 'lr': 0.00012731056966105474, 'weight_decay': 0.2348841683626428}. Best is trial 38 with value: 0.07970190724682759.
[I 2021-09-23 17:27:46,436] Trial 43 pruned.
[I 2021-09-23 17:29:19,866] Trial 44 pruned.
[I 2021-09-23 17:29:28,206] Trial 45 pruned.
[I 2021-09-23 17:37:31,618] Trial 46 finished with value: 0.08286373784078745 and parameters: {'n_layers': 1, 'dropout_rate': 0.2369304595931093, 'n_units_l0': 116, 'batch_size': 14, 'lr': 9.466146482797484e-05, 'weight_decay': 0.20151296294549714}. Best is trial 38 with value: 0.07970190724682759.
[I 2021-09-23 17:37:35,168] Trial 47 pruned.
[I 2021-09-23 17:37:39,402] Trial 48 pruned.
[I 2021-09-23 17:39:33,132] Trial 49 finished with value: 0.0823763620058915 and parameters: {'n_layers': 1, 'dropout_rate': 0.20064752116941653, 'n_units_l0': 104, 'batch_size': 62, 'lr': 0.00040245052952413314, 'weight_decay': 0.18911575126002592}. Best is trial 38 with value: 0.07970190724682759.
[I 2021-09-23 17:39:35,717] Trial 50 pruned.
[I 2021-09-23 17:40:02,287] Trial 51 pruned.
[I 2021-09-23 17:40:29,191] Trial 52 pruned.
[I 2021-09-23 17:41:12,289] Trial 53 pruned.
[I 2021-09-23 17:41:28,513] Trial 54 pruned.
[I 2021-09-23 17:44:22,251] Trial 55 finished with value: 0.06355256314022178 and parameters: {'n_layers': 1, 'dropout_rate': 0.25711536163064735, 'n_units_l0': 117, 'batch_size': 57, 'lr': 0.0003044584963559317, 'weight_decay': 0.048125362770925995}. Best is trial 55 with value: 0.06355256314022178.
[I 2021-09-23 17:44:25,394] Trial 56 pruned.
[I 2021-09-23 17:44:28,221] Trial 57 pruned.
[I 2021-09-23 17:44:42,776] Trial 58 pruned.
[I 2021-09-23 17:44:48,257] Trial 59 pruned.
from optuna.trial import TrialState
pruned_trials = study.get_trials(deepcopy=False, states=[TrialState.PRUNED])
complete_trials = study.get_trials(deepcopy=False, states=[TrialState.COMPLETE])
print("Study statistics: ")
print(" Number of finished trials:\t", len(study.trials))
print(" Number of pruned trials:\t", len(pruned_trials))
print(" Number of complete trials:\t", len(complete_trials))
print("\nBest trial:")
trial = study.best_trial
print(" Value: ", trial.value)
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
Study statistics:
Number of finished trials: 60
Number of pruned trials: 43
Number of complete trials: 17
Best trial:
Value: 0.06355256314022178
Params:
n_layers: 1
dropout_rate: 0.25711536163064735
n_units_l0: 117
batch_size: 57
lr: 0.0003044584963559317
weight_decay: 0.048125362770925995
Trials below either early stops (gradient descent loses momentum) or gets pruned (unlikely to improve even if gradient descent continues). Note that pruning starts at Trial 5. This can be tweaked in the n_startup_trials=5 parameter of the pruner. In this case, pruning is disabled until the 5 trials finish in the same study. This is so that the pruner obtains enough information about the behavior of the gradient descent optimizer before starting to prune.
plot_html(optuna.visualization.plot_intermediate_values(study))
plot_html(optuna.visualization.plot_optimization_history(study))
Hyperparameter interactions¶
We look at which combinations of hyperparameters work well from the parallel coordinate plot. Note that there is something weird going on here. For example, trials with n_layers=1 has coordinates in axes where they should have no values, e.g. n_units_l1 and n_units_l2. This is a known issue for parallel plots, e.g. #1809. Turns out, lines with dynamically constructed parameters with NaNs should be skipped by plotter. Moreover, trials with NaN values are excluded from the parameter importance computation which limits its usefulness.
plot_html(optuna.visualization.plot_parallel_coordinate(study))
study.trials_dataframe().head()
| number | value | datetime_start | datetime_complete | duration | params_batch_size | params_dropout_rate | params_lr | params_n_layers | params_n_units_l0 | params_n_units_l1 | params_n_units_l2 | params_weight_decay | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.210230 | 2021-09-23 16:35:45.058280 | 2021-09-23 16:36:55.853553 | 0 days 00:01:10.795273 | 137 | 0.279482 | 0.008515 | 2 | 88 | 115.0 | NaN | 0.367007 | COMPLETE |
| 1 | 1 | 0.087266 | 2021-09-23 16:36:55.857589 | 2021-09-23 16:42:48.010031 | 0 days 00:05:52.152442 | 21 | 0.433834 | 0.002657 | 1 | 125 | NaN | NaN | 0.189483 | COMPLETE |
| 2 | 2 | 0.132306 | 2021-09-23 16:42:48.014477 | 2021-09-23 16:44:59.375034 | 0 days 00:02:11.360557 | 201 | 0.216479 | 0.000028 | 2 | 104 | 75.0 | NaN | 0.179409 | COMPLETE |
| 3 | 3 | 0.222883 | 2021-09-23 16:44:59.381142 | 2021-09-23 16:45:53.755409 | 0 days 00:00:54.374267 | 250 | 0.474405 | 0.000179 | 1 | 28 | NaN | NaN | 0.279247 | COMPLETE |
| 4 | 4 | 2.301015 | 2021-09-23 16:45:53.757336 | 2021-09-23 16:48:09.580522 | 0 days 00:02:15.823186 | 13 | 0.280554 | 0.014257 | 3 | 113 | 25.0 | 21.0 | 0.137050 | COMPLETE |
study.trials_dataframe().query("state=='COMPLETE'").params_n_layers.value_counts()
1 14
2 2
3 1
Name: params_n_layers, dtype: int64
Instead, we can look at each subset of trials for different values of n_layers. The resulting trials have no NaN parameters since the paramaters are sampled after a value for n_layers has been suggested. Looks like n_layers=1 works best.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# Isolate a study for each value of n_layers
studies = [optuna.create_study() for j in range(3)]
for j in range(3):
studies[j].add_trials([t for t in study.trials if t.params['n_layers'] == j+1])
fig = optuna.visualization.plot_parallel_coordinate(studies[j])
plot_html(fig)
[I 2021-09-23 17:44:51,716] A new study created in memory with name: no-name-6ca29b5a-3524-441a-b429-26b0a39e11f5
[I 2021-09-23 17:44:51,722] A new study created in memory with name: no-name-dacfd576-1b50-4842-8153-2932fe6db7eb
[I 2021-09-23 17:44:51,728] A new study created in memory with name: no-name-7183fded-a3b8-4c5d-b803-241191ad9c25
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:7: ExperimentalWarning:
add_trials is experimental (supported from v2.5.0). The interface can change in the future.
/usr/local/lib/python3.7/dist-packages/optuna/study/study.py:969: ExperimentalWarning:
add_trial is experimental (supported from v2.0.0). The interface can change in the future.
From the following contour plot, we see that a low batch size is generally good, with high values of dropout, learning rate, and weight decay, and only a single hidden layer. From the above parallel plot, a hidden layer of size around 90 looks good.
fig = optuna.visualization.plot_contour(study, params=['batch_size', 'lr', 'n_layers', 'weight_decay', 'dropout_rate'])
fig.update_layout(autosize=False, width=1200, height=1200)
plot_html(fig)
fig = optuna.visualization.plot_contour(study, params=['batch_size', 'lr'])
fig.show()
optuna.visualization.plot_optimization_history(study)
Appendix: Hyperparameters of commonly used models¶
- 1
Like all applied machine learning solutions.
- 2
See Optuna dashboard which displays the same plots that are updated in real-time.
- 3
In practice, we save the best model parameters at this point.



